Pairwise alignment incorporating dipeptide covariation. 
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Abstract 

Motivation: Standard algorithms for pairwise protein sequence 
alignment make the simplifying assumption that amino acid sub- 
stitutions at neighboring sites are uncorrected. This assumption 
allows implementation of fast algorithms for pairwise sequence 
alignment, but it ignores information that could conceivably in- 
crease the power of remote homolog detection. We examine the 
validity of this assumption by constructing extended substitu- 
tion matrixes that encapsulate the observed correlations between 
neighboring sites, by developing an efficient and rigorous algo- 
rithm for pairwise protein sequence alignment that incorporates 
these local substitution correlations, and by assessing the ability 
of this algorithm to detect remote homologies. 
Results: Our analysis indicates that local correlations between 
substitutions are not strong on the average. Furthermore, incor- 
porating local substitution correlations into pairwise alignment 
did not lead to a statistically significant improvement in remote 
homology detection. Therefore, the standard assumption that in- 
dividual residues within protein sequences evolve independently 
of neighboring positions appears to be an efficient and appropri- 
ate approximation. 

Availability: Sequence data, software, and matrixes are freely 
available from |http : //compbio .berkeley . edu71 



1 INTRODUCTION 

Among the most commonly used tools in computational biol- 
ogy are the pairwise protein sequence alignmen t methods, such 
as S SEARCH, FASTA an d BLAST Jsmith _& Watermarl Il98ll 
Pearson AT, in3 Il988t lAltschul et all Il99(t lOurhin et al. 



1998). These algorithms are elegant, efficient and effective 
methods of detecting similarity between closely related pro- 
tein sequences. However, the ability of fast pairwise methods 
to detect homology deteriorates as the divergence between the 
sequences increases. Past the "twilight zone" (20-30% pair- 
wise sequence iden tity), only a small fraction of related pro- 
teins can be found. JSander & Schneideilfl99lllDoohttlelll992t 
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iBrenner et oil Il998t iGreen & Brennerl 120021) . Therefore, in or- 
der to make better use of the vast and increasing amount of avail- 
able biological sequence data, there is an immediate need for 
more sensitive, fast database search methods. 

For the sake of computational efficacy, current pairwise align- 
ment methods make several simplifying assumptions. First, 
amino acid substitutions are assumed to be homogeneous be- 
tween protein families. The most commonly used substitu- 
tion matrices (BLOSU M (THenikoff & Henikofd Il993> and PAM 
foavhoffef all 1 1978)) are thus generic models of protein se- 
quence evolution across all protein sequence families at various 
evolutionary distances. Second, substitutions at a given site are 
assumed to be uncorrelated with those on neighboring sites. That 
is, the likelihood of substituting an amino acid X for amino acid 
Y is assumed to be independent of the sequence context of X. 
It is known that both of these simplifying assumptions intro- 
duce errors into homology searching. Relaxing the assumption 
of homogeneous substitution across protein families can signifi- 
cantl y improve the performance of pairwise alignment methods 
dYuef al.lEool) . Furthermore, alignment methods that remove 
the assumption of homogeneity among different positions in the 
sequence, and instead model the heterogeneity of the given pro- 
tein family, have been fou nd to be dramatic ally superior for re- 
mote homology detection JParkef a/.Lfl99l R. E. Green and S. 
E. Brenner, Unpublished data). Unfortunately, these profile meth- 
ods ( e.g. PS T-BT.AST dAltschul et all Il997l) . HMMER (Eddv, 
l200 lb . SAM JKarplus et al.Ul998l) etc. ) are not tractable for all 
query sequences. They require the presence, identification, and 
correct alignment of homologous sequences in order to generate 
a model of the query sequence's family. Therefore, the fast and 
universally applicable pairwise methods remain widely used for 
database searching, despite their lower sensitivity. 

One proposed strategy for increasing the sensitivity of pair- 
wise alignment is to use a more sophisticated scoring function for 
amino acid substitutions, namely one that is sensitive to the se- 
quence context in which the residue resides. For example, amino 
acid sequences are correlated with secondary structural features, 
such as helixes and loops, whi ch can directly lead to structure de- 
pendent substitution pattern s dThorne et a/.LIl996tlTopham et all 
ll997tlGoldman et q/.LIl998T) . Similarly, one might intuitively ex- 
pect structurally and functionally important residues, such as cys- 
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teines and prolines, to be more or less conserved depending on 
their local sequence environment and the prevalence of particular 
motifs. 

The first large-scale exploration of the effect of sequence con- 
text on a mino acid evolution was performed by Gonnet and co- 
workers ( Gonn et et al ., 1994), who examined the frequencies of 
dipeptide substitutions, and compared them to the dipeptide sub- 
stitution frequencies expected assuming no sequence dependent 
correlations. Despite the fact that nearly half of the elements of 
the 400 x 400 observed dipeptide matrix were vacant (due to the 
sparsity of data) several interesting patterns were evident. The 
chief trend was that amino acids are generally more likely to be 
conserved if they are adjacent to po sitions that are also conserved. 

More recentlv ljung & Lee] J2000h have taken advantage of the 
large increase in available data to reexamine trends in dipeptide 
evolution. They used the observed patterns of substitution within 
a large set of structure-based alignments to generate dipeptide 
substitution matrices. Furthermore, they developed an extension 
to the standard Smith-Waterman alignment algorithm that incor- 
porates a term from these dipeptide matrices. By using sequence 
and structure context information, they show some improvement 
in homolog detection in a limited test set. However, their method 
could not be extensively tested or practically utilized because an 
efficient dynamic programming method for finding the optimal 
alignment was not known to the authors. Instead, they adopted a 
heuristic search that is not guaranteed to find optimal alignments. 

In this study, we have extended the work described above by 
examining the strength of local, dipeptide substitution correla- 
tions using the massive amount of alignment data within the 
BLOCKS database. We have also extended the standard Smith- 
Waterman algorithm to include local dipeptide correlation in- 
formation over a user-defined distance. Like Smith- Waterman, 
this new polynomial time algorithm, doublet, finds the op- 
timal alignment under the scoring scheme described. Using a 
standard remote homolog detection evaluation strategy, we have 
tested doublet against the Smith- Waterman algorithm to mea- 
sure the impact of including this extra information. Perhaps sur- 
prisingly, we found that incorporating doublet substitution corre- 
lations leads to a statistically insignificant difference in homology 
detection. 



2 Methods 

2.1 Quantifying substitution correlations 

Consider two aligned, ungapped sequences, x = x\,X2, ■ ■ • , x n 
and y = yi,y2, ■ ■ ■ ,Vn, both of length n, where each element rep- 
resents one of the 20 canonical amino acids. We wish to use the 
patterns of conservation and variation between these sequences 
to estimate the log odds that the sequences are homologous (i.e., 
that both sequences have descended from a common ancestor). 

S = log g ( X ' y ) = l g. l( Xl > X 2' ' ' ■> X n,yi,y2, ■ ■ ■ ,Vn) 



' p(x)Ky) 



' p(x!,x 2 , ■ ■ ',x n )p(y 1 ,y 2 , ■ ■ ■ ,y n ) 



(1) 



Here, p(x) is the background probability of the given amino acid 
segment and q(x; y) is the target probability of observing the pair 



of segments in diverged homologous sequences. 

Except for very short segments, the background and target 
probability distributions are large and cannot be directly mea- 
sured. Therefore, Eq. [Qis typically simplified by assuming that 
substitutions probabilities are homogeneous (independent of the 
location in the fragment) and that both the substitutions and the 
sequences themselves are uncorrelated from one position to the 
next. Consequentially, the total similarity score is now a sum of 
independent parts, 



S ~ z2 s ( x k;Vk), s(i;j)= log 



P{i)p(j) 



(2) 



The log odds of residue replacement, s(i,j), is an element of a 
standard singlet substitution matrix, of the type widely used in 
pairwise sequence alignment L\ltschuill99ll) . 

This approximation of the full similarity by a sum of singlet 
substitution scores requires that we neglect all inter-site correla- 
tions. We can perform a more controlled approximation by not- 
ing that a homogeneous multivariate probability can be expanded 
into a product of single component distributions, pairwise corre- 
lations, triplets correlations, and so on. 



P(zi,z 2 , ■ ■ ■ , z n ) = JJ P{zi) x Yl 



l<3 



n 

i<j<k 



P{z l ,z 3 ,z k )P{z l )P{z k )P{z j ) 

P(Zi, Zj)P(Zi,Z k )P(Zj,Z k ) 



(3) 



If we assume that substitution probabilities are independent of the 
location within the fragment, then we can apply this expansion to 
the segment homology score (Eq.Q. 



S = ^2s(x k ;y k ) 
k =i 



L n — L 

EE 

i=i fe=i 



di(x k ,x k+ i;y k ,y k+ i) 



(4) 



The first term of this expansion represents single residue replace- 
ments, as in Eq.|3] The next term defines the doublet substitution 
scores, 



log 



qi(i,i';j,f) 

Pi{i,i')Pi(j>f) 



s(i;j)-s(i';f) (5) 



Here, i and i' are residues separated by a distance I along one 
amino acid chain, while j and j' are the corresponding aligned 
residues on the putative homologous sequence; qi(i,i';j,j r ) 
is the target probability of observing this aligned quartet, and 
pi (i, i') is the background probability of this residue pair in pro- 
tein sequences. These doublet scores represent the additional 
similarity due to correlations between substitutions. 

By truncating the expansion of the full similarity score at dou- 
blet terms (Eq. |4jl, we are assuming that triplet and higher or- 
der correlations between substitutions are relatively uninforma- 
tive. For reasons discussed below, this is probably a reasonable 
approximation. Furthermore, the most important inter-site cor- 
relations are between residues neighboring on the chain (Fig. [3}. 
Therefore, we can restrict the maximum distance over which dou- 
blet interactions are scored without serious error. 
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The average simil arity score is the inter-homolog mutual in- 
formation / JCover & Thomas! fl99ll) . a measure of the inter- 
sequence correlations. A high mutual information value indicates 
strong correlation, whereas a mutual information value of zero 
indicates uncorrected variables. Mutual information has various 
advantages as a correlation measure: it is firmly grounded in in- 
formation theory, it is additive for independent contributions and 
it has consistent, intuitive units (bits). 



J(x;y) = 5^ ?(x,y) log 2 



<?(x,y) 

p( x My) 



(6) 



The average singlet score is the inter-homolog mutual informa- 
tion per residue, under the assumption that replacements are un- 
correlated. This is frequently reported as the "relative entropy" 
of the substitution matrix. The average doublet score is the first 
order correction to the inter-sequence mutual-information due 
to inter-site correlations. Consequentially, we may evaluate the 
comparative importance of singlet and doublet contributions to 
the sequence similarity by examining the average contributions 
of these different components to the full inter-homolog mutual 
information. 

The preceding analysis applies to contiguously aligned se- 
quence segments. However, in addition to substitutions, protein 
sequences are modified by the insertion and deletion of residues. 
Since it is not obvious how to capture the existence of indels in 
doublet scores, in the following discussion we assume that dipep- 
tide correlations do not extend across gaps, and we adopt the sim- 
ple and standard affine model of gap lengths. This approximation 
should have little impact, since aligned detectably homologous 
sequences tend to have relatively few indels, particulary in re- 
gions that are significantly similar. 

2.2 Alignment Algorithm 

We have extended the standard Smith-Wat erman optima l local 
sequence alignment algorithm (Smith & Waterman, 11981 1) to in- 
corporate doublet substitution scores (See Fig.Q. The time com- 
plexity of Smith- Waterman is 0(nm), where n and m are the 
lengths of the two sequences. Adding doublet scores increases 
the complexity to 0(nmL), where L is the distance over which 
substitution correlations are scored. This efficient dynamic pro- 
gramming alignment is possible because, although we are scoring 
correlations between residues that are not directly aligned, these 
correlations are local along the chain. The space complexity of 
our implementation is al so Q(nmL)\ this co uld be improved us- 
ing standard techniques (Durb inef a/.[ fT998>. 

The additional similarity score associated with adding the final 
match pair Xi,yj to the alignment contains singlet (S) doublet 
(D) substitution scores; 

S(i,j) = s(xi,yj), (7) 

r 

D (i,j,r) = ^difa-ux^yj-uyj). (8) 
i=i 

Here, r is the length of the preceding contiguous segment of 
aligned residues, or the maximum sequence separation over 
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Figure 1: A comparison of Smith-Waterman and doublet se- 
quence alignment, (a) A Smith- Waterman match table, with the 
optimal alignment highlighted. The value of each cell is the max- 
imum of 1 . the singlet match score (this is the start of an align- 
ment ), 2. the singlet score plus the match score from the previous 
cell along the diagonal (this extends an aligned region), or 3. the 
singlet score plus the optimal score from a gap score table (the 
previous residue was not aligned) (b) For doublet, multiple 
match tables are used (Eas. l9llTl . The number of match tables is 

1 plus the distance over which dipeptide correlation information 
is considered (in this example, 2). Again, the optimal alignment 
is highlighted. The top table corresponds to the starts of aligned 
regions, the middle table corresponds to aligned regions of at least 

2 consecutive residues and the bottom table corresponds aligned 
regions of at least 3 consecutive residues. The alignment path 
through these tables falls through to lower tables in regions of 
conecutive aligned residues and begins again in the top table fol- 
lowing gaps. To extend dipeptide context scoring over longer 
distances requires additional match tables. 



which doublet correlations are scored, whichever is less. Dele- 
tions of length k are weighted with the affine penalty — (<? pen + 
(k — l)g cxt ), where g open and g cx t are positive constants. This 
standard affine gap length model i s both c omputationa lly ef- 
ficie nt and sur prisingly effe ctive. JSmith & Waterman! 1 198 it 
lAltschul & Ericksonlll985lZachariah et q/.ll2005h . 

The optimal, highest scoring alignment between two sequences 
(x = x 1} x 2 , ■■■ ,x n and y = yi,V2,- ■■ , y m ) is found by popu- 
lating a series of score tables, also known as dynamic program- 
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ming matrices. The entries of the match table, M(i,j, r), are the 
maximum alignment score for an alignment that terminates with 
an ungapped segment of length r, ending at the ith position of x, 
and the jth position of y. Similarly, the gap tables G x (i,j) and 
G y contain the maximum alignment similarity given that the 
alignment ends with Xi or yj gapped. The entries of these tables 
can be efficiently computed starting from the following boundary 
conditions: M(i, 0, 1), M(0, j, I), G x/y (i, 0), G x/y {0, j) = -oo. 
A single aligned amino acid pair may signal the beginning of a 
new local alignment, or it may occur immediately after any align- 
ment gap. 



M(i,j,l) 



S(i,j) 

S(i,j) + G x (i- 



(9) 



S(i,j)+Gy(i,j-1) 



In standard Smith- Waterman this is the only necessary match 
score table. However, in doublet we require additional match 
tables so that we may keep track of match scores over extended, 
contiguously aligned regions. Of necessity, longer ungapped seg- 
ments occur only after shorter segments. We restrict the maxi- 
mum distance L over which doublet correlations are scored, since 
we expect that the useful information that can be extracted from 
doublet correlations will decay rapidly with sequence separation 
(See Fig. [3}. Consequentially, we do not need to explicitly con- 
sider ungapped segments of length greater than L + 1. 

M(i,j,2<r<L) = S(i,j) + D(i,j,r-1) (10) 

+ M(i - l,r- 1) 

M(i,j,L + l) = S(i,j) + D(i,j,L) 

M(i-l,j-l,L) 
M(i-l,j -1,L + 1) 

Gaps in the alignment are either preceded by a match or they 
extend an existing gap. 



G x (i,j) 
G y (i,j) 



m;)N M(i-l,j- l,r)-g open 
r =i,L \ G x (i - l,j) - g cxt 

M(i l,r) - g opcn 

r"i,L \ G y (i, j - 1) - g cxt 



(11) 



The largest score within the match table marks the last aligned 
position of the optimal alignment. The full alignment can be 
found by backtracking through the table, according to the choices 
previously made during the scoring step. 

We used the method of iBailev & Gribskovl J20Q2I) to fit an 
extreme value distribution to the results of aligning a query se- 
quence against a database of possible homologs. The maximum 
likelihood parameters are then used to assign E-values to each 
alignment. 

2.3 Doublet BLOcks Substitution Matrix 

A doublet substitution matrix (Eq. [5} contains 20 4 = 160, 000 
entries, of which 20 2 x (20 2 + l)/2 = 80, 200 are unique due 
to the underlying symmetry, di(i, i'\ j, j') = di(j,j';i,i'). To ac- 
curately estimate these scores we require a very large collection 



of reliably aligned protei n sequences. The BLO CKS data base is 
one such resource dHenikoff & Henikofd Il992t iHenikoff etail 
2000). Each database block consists of a reasonably reliable, 
ungapped multiple sequence alignment of a core protein region. 
BLOCKS version 13+ contains 11, 853 blocks, containing, on av- 
erage, 56 segments of average length 26 residues. Overall, about 
10 9 pairwise amino acid comparisons are available for study. 

The widely used canonical BLOcks Substitution Matrixes 
(BLOSU M) were generated from ve rsion 5 of the BLOCKS 
database (Henikoff & Henikoff , 1992). In order to generate a se- 
ries of matrices representing different evolutionary divergences, 
the sequences in each block are clustered at a given level of se- 
quence identity and the inter-cluster sequence correlations are 
collected. Thus BLOSUM 100 (where only 100% identical se- 
quences are clustered) represents a wide range, including low lev- 
els, of evolutionary divergence, whereas BLOSUM30 represents 
only correlations between very diverged sequences. 

In principle, we should match the divergence inherent in the 
substitution ma trix to the divergence of the pair of sequences we 
wish to align ^ Bishop & Thompson, 1986: iThorne et all Il99ll 
Il992t Altschul, 1993). However, this is computationally expen- 
sive, and, in practice, a single matrix is chosen based on its 
ability to align remote homologs, on the grounds that matching 
close homologs is relatively easy jBrennerl Il996l iBrenner et all 
1998; Cro oks & Brenner! 120051) . In a recen t evaluation of re- 
mote pairwise homology dete ction efficacy JGreen & Brenner! 
l2002l IZachariah eTall l2005h . we discovered that the BLO- 
SUM65 substitution matrix, reparameterized from the BLOCKS 
13+ database, was more effective than any other reparameter- 
ized BLOS UM (BLOCKS 13+) classic BLOSUM (BLOCKS 
5) or PAM ( Davho ff et all Il978l) substitution ma trix, and was 
comp arable to the most effective VTML matrix (tMiiller et all 
2002). Consequentially, we have built singlet and doublet sub- 
stitution matrices from the BLOCKS 13+ database at 65% clus- 
tering , using an adaptation of the original BLOSUM clustering 
code (Heni koff & Henikofd ll992T) . This provides approximately 
10 7 - 10 8 independent aligned doublets, depending on the se- 
quence separation I. 

The estimated doublet target frequencies qi(i,i' '; j, j') where 
smoothed and regularized by adding a pseudocount a(i, j') 
to the raw count data, n(i,j';j,j'). These pseudocounts are 
taken to be proportional to the marginal singlet target probabil- 
ities, qi(i;j)qi(i',f). 



A + N 
A x q(i;j)q(i';f) 



(12) 
(13) 



Where, is the total number of counts. Thus, if no data are 
available (the total number of counts is zero, N = 0), then all 
doublet scores would be zero, as can be seen from Eq.[3] Here, 
A is a scale parameter that determines how much data is required 
to overcome the prior probability inherent in the pseudocount. 
Typically, such scale factors are picked empirically. However, in 
this case, we performed a full Bayesian analysis and determined 
that for doublet substitutions reasonable values of A are about 
2 x 10 6 , which can be compared to the 10 7 to 10 8 actual obser- 
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BLOSUM65 (from BLOCKS 13+) 
Singlet Substitutions Doublet Substitutions (Selected entries) 



AA AA 

AD AD 

AD DA 

DA DA 

DD DD 

CA AD 

CA AC 

CA AQ 

PI LF 

PI LP 

PI LS 



-1 16 4 
-2 4 11 



RA AA 

RA AR 

RA AN 

PC CG 

PC CL 

PC CK 

PC CP 



CC CA 

CC CR 

CC CN 

CC CD 

CC CC 

CC CQ 

CC CE 

CC CG 

CC CH 

CC CI 

CC CL 

CC CK 

CC CM 

CC CF 

CC CP 

CC CS 

CC CT 

CC CW 

CC CY 

CC CV 



ET AA 

ET AR 

ET AN 

ET AD 

ET AC 

ET AQ 

ET AE 

ET AG 

ET AH 

ET AI 

ET AL 

ET AK 

ET AM 

ET AF 

ET AP 

ET AS 

ET AT 

ET AW 

ET AY 

ET AV 



Figure 2: BLOSUM65 singlet substitution matrix derived from the BLOCKS 13+ database (left), and selected elements of the 
corresponding doublet substitution matrices (right). Scores are in 1/4 bit units, rounded to the nearest integer. The average standard 
statistical error is about 1/4 bits (i.e. about 1 unit) for the doublet scores, and essentially insignificant for the singlet scores, as judged 
by bootstrap resampling (See Sec. 12.31 The singlet scores are the log odds of observing the given substitution; positive scores are 
more likely, and negative score less likely to be observed than would be expected for uncorrected sequences (Eq. 0. Similarly, 
the doublet scores represent the log odds for observing pairs of substitutions, at various sequence separations, relative to the singlet 
substitutions likelihood (Eq.|5)- For example, the L=3 column for ET AV (bottom right) indicates a score of zero for the alignment 
of ExxT in one sequence to AxxV in the other. 



vations. The full details are given in the supplemental materials. 
A representative subset of a doublet substitution matrix is shown 
in fig.El 

Standard statistical errors were estimated by n on-parametric 
Bayesian bootstrap resampling on sequence blocks ilEfi-onll 19791 
Rubir| Il98ll) . Instead of assigning equal weight to every se- 
quence block, each block is instead given a random weight drawn 
from a Dirichlet distribution. This random reweighting induces 
random changes in the estimated scores, thereby providing an es- 
timate of the statistical errors caused by the finite size and inho- 
mogeneity of the training data. 

2.4 Evaluation of remote homology detection 

We have previously developed and applie d a sensitive s trategy 
for evaluation of database search methods ferenneref artll99a 



rr 



Green & BrenneJ. 120021 Izachariah 1200,4 iPrice et al 

2005). This strategy is made possible by the availability of a 
large collection of protein sequences whose evolutionary inter- 
relations are known (primarily from structural information). In 
our approach, each sequence is aligned against every other se- 
quence, and the alignment scores are used to determine putative 
homologs. We then consider the proportion of correctly identi- 
fied homologs as a function of erroneous matches. Because the 
homology information derives from sequence-independent data, 
we avoid the circularity inherent in other evaluation approaches. 

The collection of related sequences is derived f rom the Struc - 
tural Classification Of Proteins (SCOP) databas e JlVIurzin et al. 



1995). We use the ASTRAL compendium (Chandonia et al. 



2004) of representative subsets of SCOP release 1.61 (Sept. 



2002), filtered so that no two domains share more than 40% se- 
quence identity. We partition every other SCOP fold into sepa- 
rate test and training subsets of approximately equal size, each 
containing about 550 superfamilies, 2500 sequences, and 50,000 
homologous sequence pairs. To avoid over-fitting, adjustable pa- 
rameters are optimized using the training set. Results of an all- 
versus-all comparison of the test set, using these optimized pa- 
rameters, are reported as a plot of coverage (fraction of true re- 
lations found) versus errors per query (EPQ), the total number of 
false relations divided by the number of sequences (See Fig. [4}. 
The raw, unnormalized coverage is the fraction of all true rela- 
tions that are found. 

Since the number of relations within a superfamily scales as the 
square of the size of the superfamily, and because SCOP super- 
families vary greatly in size, this reported coverage is dominated 
by the ability to detect relations within the largest superfamilies. 
To compensate for this unwarranted dependence, we also report 
the average fraction of true relations per sequence (linear nor- 
malization) and the average fraction of true relations per super- 
family (quadratic normalization). In general, large superfamilies 
are more diverse, and the relation ships within them are harder to 
discover JGreen & BrennerlEo 02). Thus, unnormalized coverage 
is typically less than the linearly normalized coverage, which in 
turn is less than quadratically normalized coverage. One impor- 
tant point of comparison for search results is 0.01 errors per query 
rate for linearly normalized results, the average fraction of true 
relations per database query at a false positive rate of 1 in 100. 
We report the observed difference in coverage of two methods 
at this selected EPQ, and determine standard statistical err ors and 
confidence intervals using Bayesian bootstrap resampling ( Rubin, 
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1 2 3 4 5 6 

L 

Figure 3: The inter-sequence mutual information of homologs 
encoded in inter-site correlations at increasing separation, L. In 
other words, the average doublet substitution scores (Eq.|6j. The 
top, dark line is the total information at various sequence sepa- 
rations. For comparison, the information encoded in the corre- 
sponding singlet substitutions (the average singlet matrix score) 
is 0.31 bits per residue. The remaining lines illustrate the rela- 
tive contributions of different substitutions classes to this total in- 
formation; these are exact conservation XY^XY, partial conser- 
vation XY^XZ, swaps XY^>YX, partial swaps XY^-ZX, and 
unconserved, double substitutions XY^ZU. 



ll98lUPrice e7aILl2005t> . 

3 RESULTS 

3.1 Doublet Substitution Correlations. 

Various trends are evident within the doublet score matrix, as il- 
lustrated in fig|2] Notably, exact conservations, such as AA^->AA, 
AD^AD and DD^DD, etc., generally have positive scores. 
This is expected because the pairs of sequences used to build the 
BLOSUM matrices have a variety of inter-sequence similarity, 
ranging from mostly conserved to very diverged. Thus the ob- 
servation of a conserved residue suggests that the sequences are 
relatively undiverged, and therefore that other aligned residues 
are also more likely than average to be conserved. 

Also notable is that many (but far from all) exact swaps, such 
as DA^AD, are significantly more likely that expected. Possi- 
bly, this is because the effect of a deleterious mutation X— >Y can 
sometimes be ameliorated by the occurrence of the corresponding 
mutation Y— >X, in the immediate sequence neighborhood. Par- 
tial swaps, where only one of the substitution pair is conserved, 
are also often positive. This might reflect alignment errors in the 
original dataset. The most highly positive scores (and therefore 
those events that are most over-represented in the data relative to 
uncorrelated substitutions) are associated with the substitutions 
PC^-Cx, i.e., a translocation of a cystine, replacing a proline. 
The most relatively uncommon substitutions involve the muta- 
tion of one cystine in the cystine pair CxxC (second column), 



a widespread and important motif found, for example, in the 
thioredoxin family. However, these interesting particular cases 
are atypical. Most of the doublet substitution matrix is similar 
to the ET^Ax substitutions displayed in the third column; the 
majority of the scores are not significantly different from zero, 
indicating that most possible substitution doublets are essentially 
uncorrelated. 

We can place the above observations on a quantitative footing 
by considering the inter-sequence mutual information (Eq. |6j, a 
measure of the correlation strength between aligned homologous 
sequences. The first order contribution is equal to the average sin- 
glet score, which is 0.3 1 bits per aligned residue for BLOSUM65 
(BLOCKS 13+). The corresponding average doublet score, the 
additional information encoded in inter-site substitution covari- 
ation, is around 0.04 bits at modest sequence separations (illus- 
trated in fig.|3). Thus, the inter-site substitution correlations carry 
relatively little information. However, these correlations appear 
to persist to non-local neighbors, which suggests that the total 
information from interactions at all sequence separations is sub- 
stantial. However, figure[3]also displays the contributions to this 
total information from various categories of substitution. The 
largest contribution, and the only contribution to persist above 
a sequence separation of 4 residues, represents exactly conserved 
pairs of residues. This is a rather trivial correlation which is per- 
sistent because all parts of two homologous sequences have the 
same chronological divergence. All other substitution classes, 
summing over all sequence separations, contribute no more than 
0.1 bits per residue. This is not entirely insignificant, but it is 
still small compared to the singlet mutual information. Thus non- 
trivial correlations between substitutions are relatively weak. 

3.2 Homology Detection 

The primary use for pairwise alignment methods is to search 
databases of previously characterized biological sequences for 
homologs of the sequence of interest. Therefore, the most power- 
ful methods will perform this task most effectively by assigning 
true homologs significant statistical scores and assigning unre- 
lated sequences low statistical scores. Our assessment methodol- 
ogy compares database search methods on this criteria. 

We compared the doublet alignment algorithm against the 
standard Smith- Waterman algorithm. To perform a fair test, we 
converted raw scores to statistical scores for both algorithms us- 
ing the same length normalize d maximum likelihood EV P pa- 
rameter determination method (B ailev & Gribskovl Eo02). Op- 
timal parameters for gapping, matrix scaling, and distance over 
which to consider dipeptide correlations were found using the 
training database described above. Then, the algorithms were 
evaluated by comparing the relative ability to detect remote ho- 
mologs within the test dataset, using the parameters optimized on 
the training dataset. (Insert, fig.EJ. 

The results of a database search for Smith- Waterman and 
doublet, using only nearest neighboring dipetide covariations, 
are shown in Fig.|^. Both the Smith- Waterman and doublet 
methods performed remarkably similarly over all error rates and 
normalization schemes. The linearly normalized coverage at 



6 



(a) 



Coverage versus errors per query 
< 40% sequence identity remote homologs 



10 



0.1 



2 0.01 

LU 

0.001 
0.0001 

(b) 

10 



CD 

D- 0.1 



in 

2 0.01 



0.001 



0.0001 



■° / 
s / 

<o/ 

o / 
c / 




$/$ " 


Doublet: 

mlth-Waterrrian — — 




Y° < 


ff c 

Error rate at which 








bootstrap test was performed 















0.2 



0.4 0.6 
Coverage 



0.8 



< 30% sequence identity remote homologs 




Doublet I 
Smith-Waterman ■ 



12 3 
Matrix Scale 65 60 60 60 

gopen 17 17 18 20 

gext 2 2 2 1 

Coverage (0.01 EPQ) 0.236 0.229 0.225 0.223 



0.2 



0.4 0.6 
Coverage 



0.8 



Figure 4: These coverage versus errors per query plots show that 
including dipeptide covariation information in alignment deter- 
mination (doublet) does not improve remote homolog detec- 
tion, (a) Optimized matrix, gap and look-back parameters were 
used to search the test database with the doublet and Smith- 
Waterman algorithms. This database contains no sequence pairs 
that share more than 40% sequence identity. The number of cor- 
rectly identified homologs is shown as a function of the number 
of errors made. Smith-Waterman outperforms doublet over 
all but extremely low error-rates, (b) Remote homolog test us- 
ing only sequence pairs with less than 30% sequence identity. 
As above, Smith- Waterman correctly identifies more remote ho- 
mologs than the doublet algorithm. Insert: Optimal matrix 
scale parameter, gap parameters, and corresponding linearly nor- 
malized homology detection coverage at 0.01 EPQ, as a function 
of the covariation distance considered, L 



0.01 errors per query was slightly higher for Smith- Waterman 
than doublet (Insert, fig.|4}. From this, we conclude that in- 
cluding dipeptide covariation information does not improve re- 
mote homology detection and, in fact, slightly degrades perfor- 
mance at this error rate. We also performed the same coverage 
versus errors per query analysis using only sequences with less 
than 30% sequence identity (Fig. |4j)), as it was previously re- 
ported that dipeptide covariation information may be useful only 
for detecting these extremely remote evolutionary relationships 
(Jung & LeeL l2000). Our results, however, show that even at this 
evolutionary distance, dipeptide covariation scoring does not im- 
prove homology detection. 

We used Bayesian bootstrap resampling to estimate statistical 
errors and to determine if the observed coverage difference was 
statistically significant. We found that a 95% confidence interval 
for the coverage difference at 0.01 errors per query comfortably 
contained zero difference. Therefore, we cannot distinguish be- 
tween the remote homolog detection abilities of Smith- Waterman 
and doublet. 

We also evaluated the effect of including covariation informa- 
tion over larger sequence separations. As can be seen in table 
of fig.|4] incorporating this additional information into alignment 
scores actually results in a slow degradation of homology detec- 
tion efficacy. 

4 DISCUSSION 

We have developed, implemented, and tested an alignment al- 
gorithm, doublet, that generates the optimal pairwise protein 
sequence alignment under a scoring scheme that includes dipep- 
tide covariation information. Perhaps surprisingly, and in marked 
contrast to previous reports, we found that using this informa- 
tion provides no benefit to remote homolog detection. The per- 
formance of the doublet algorithm for detecting remote ho- 
mologs is statistically indistinguishable from the standard Smith- 
Waterman algorithm. 

The underlying explanation for this indifference of alignment 
to dipeptide covariation is that substitution correlations are weak 
on the average (Figs. |2] and [5). Therefore, the average effect of 
these interactions is insignificant and including covariation in se- 
quence alignment makes very little material difference to remote 
homology detection. 

We might reasonably question if the training data is at fault. 
Indeed, the slight degradation of homology detection as more 
distant correlations are included (Insert table, fig. |4ji does indi- 
cate that the doublet substitution matrices contain anomalies, per- 
haps due to the training or alignment of the BLOCKS sequences, 
or perhaps because of the different sampling of sequences in- 
cluded in BLOCKS compared to those included in SCOP. The 
BLOCKS database that we use to train the doublet substitution 
matrices contains ungapped alignments, many of shorter length 
than the average SCOP protein domain. Fikami-kobayashi and 
co-workers showed that the covariation sign al is strongest w ithin 
single secondary structure elements ( Fuk ami-Kobavashi et all 
2002). The poor performance of doublet, then, may be due 
to its applying the covariation model too bluntly across entire 
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protein sequences when it is only applicable within secondary 
structure elements. However, we note that the BLOCKS database 
has b een used to derive very effective singlet substitution matri- 
ces (jGreen & Brennen 12002). and therefore it is implausible that 
the substitution signals within the BLOCKS database are sub- 
stantially erroneous. Rather, the observed degradation simply re- 
inforces the idea that neighboring substitutions are weakly cor- 
related, particularly when compared to single substitution corre- 
lations, and therefore the doublet signal is readily degraded by 
minor anomalies in the data. 

Another line of evidence comes from examining the inter- 



site amino acid correlation of single protein sequences (Yeas, 



1958; Weiss et al, 2000; Crooks & Brenner, 2004; Crooks et al. 



2004). Neighboring amino acids are almost entirely uncorre- 
lated; the nearest neighb or mutual infor mation has been estimate 
as only 0.006 bits JCrooks & Brenner! |2004 . This lack of se- 
quence correlation is consistent with (but does not require) small 
inter-site substitution correlations. 

In should be emphasized, however, that the observation of 
weak average dipeptide covariation does not negate the possibil- 
ity of strong, interesting covariation in particular instances, such 
as CP^Cx, or within particular families. Moreover, it is con- 
ceivable that covariation information could be used more judi- 
ciously, thereby improving alignment results. For example, as 
previously discussed, one might include doublet-type scoring in- 
formation only for residue pairs that are likely to be within the 
same secondary structural element. Similarly, one might exam- 
ine the covariation of residues that are pr oximate in the tertiary 
struct u re, rather t han al ong the sequence (Rodionov & J ohnson. 
1 19941 iLin et al.i 120031) . However, residu es that are proxi- 
mate in space are al so only weakly correlated! C line et all 120021 
ICrooks et al\ [2004), and the inter-residue mutual information is 



not improved by foreknowledg 


e of the local structure environ- 


ment (Crooks & Brenner, 


2004; 


ICrooks et aZ.L 120041 Therefore, 



we suspect that such approaches will also not have dramatic ef- 
fects on protein sequence alignment. 

In conclusion, the ubiquitous assumption that neighboring sites 
along a protein sequence evolve independently appears to be gen- 
erally appropriate. This leads to fast, elegant and effective algo- 
rithms for protein sequence alignment and homology detection. 
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APPENDIX: Estimating probabilities 
from counts with a prior of uncertain 
reliability. 

A common problem is that of estimating a discrete probability 
distribution, 9 = {9i, 62, ... , 9k}, given a limited number of 
samples drawn from that distribution, summarized by the count 
vector n = {ni, n,2, . . . , n.fc}, and a reasonable a priori best guess 
for the distribu tion 9 sa tt = {771 , 71?, . . . ,7Tfc}. (For a general in- 
troduction, see Dur bin et q/.ll9 98.) This guess may simple be the 
uniform probability, TTj — l/k, which amounts to asserting that, 
as far as we know, all possible observations are equally likely. At 
other times, we may know some some more detailed approxima- 
tion to the distribution 9. 

For example, in the present case we wish to estimate the prob- 
abilities of substituting a pair of amino acid residues by another 
residue pair, given the number of times that this substitution 
has been observed in the training dataset. This probability is 
hard to estimate reliably since the distribution is very large with 
20 4 = 160, 000 dimensions. Moreover, many of the possible 
observations occur very rarely. However, substitutions at differ- 
ent sites are not strongly correlated, and therefore we may ap- 
proximate the doublet substitution probabilities by a product of 
single substitution probabilities. Since the dimensions of these 
marginals are relatively small we can accurately estimate them 
from the available data, and thereby construct a reliable and rea- 
sonable initial guess for the full doublet substitution distribution. 

In the common and conventional pseudocount approach, we 
assume that the distribution tt was estimated from A previous ob- 
servations. These pseudocounts, on — ttiA, are then proportion- 
ally averaged with the real observations (N = J^i n i) t0 provide 
an estimate of 9; 

9i = (14) 

1 A + N 

This prescription is intuitively appealing. When the total num- 
ber of real counts is much less than the number of pseudocounts 
(N <C A) the prior dominates, and the estimated distribution is 
determined by our initial guess, 9 rs tt. In the alternative limit 
that the real observations greatly outnumber the pseudocounts 
(N ^S> A) the estimated distribution is given by the frequencies 
&i = n%/N. However, it is not immediately obvious how to se- 
lect A, although many heuristics have been pro posed, includin; 
A = 1, A = k (Laplace) , and A = y^/V (e . g. [Lawreni 
I1993L iDurbinef a/l ll998t iNemenman et ali l200lh . Essentially, 
this total pseudocount parameter represents our confidence that 
the initial guess 9 sa tt is accurate, since the larger the total pseu- 
docount the more data is required to overcome this assumption. 

Within a Bayesian approach we can avoid this indeterminacy 
by admitting that, a priori, we do not know how confidant we are 
that tt approximates 9. The probability P(n\9) of independently 
sampling a particular set of observations, n, given the underlying 
sampling probability, 9, follows the multinomial distribution, the 
multivariate generalization of the binomial distribution; 



M(n\9) 



1 



M(n) 



11^% M(n) = 



a 



!=1 



(Ei«i)! 



(15) 
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The prior probability of the sampling distribution P(9) is typ- 
ically modeled with a Dirichlet distribution, 



v ; i=i 



z K a ) = r ^ • (16) 



where ^ 6 = 1, ctj > and A = ^ - a*. Note that the mean of 
a Dirichlet is 

E[0i] = ^. (17) 

Therefore, we may fix the parameters of the Dirichlet prior by 
equating our initial guess, n, with the mean prior distribution: 
7r = a/ A If we can fix the scale factor A, then we can combine 
the prior and observations using Bayes' theorem. 



P{9\n) 



P(n\6)P{6) 



(18) 



Because the multinomial and Dirichlet distributions are natu- 
rally conjugate, the posterior distribution P(9\n) is also Dirichlet. 

P{6\n) oc M(n\9)V(9\An) 

k 

n a (Am+m-l) 



i=l 

T>(9\An + n) 



(19) 



The last line follows because the product in the previous line is an 
unnormalized Dirichlet with parameters (An + n), yet the prob- 
ability P(9\n) must be correctly normalized. 

Given multinomial sampling and a Dirichlet prior, the prob- 
ability of the data is given by the under-appreciated multivari- 
ant negative hyperg eometric distribution (Lfohnson & Kotzt[l969t 
lDurbinefaZ.l fl998.Ea. 11.23); 



P(n) 



d9 P{n\0)P(0), 
d9 M(n\9)V(9\An) 



Z(Att) M(n) 



del[e. 



(Awj+ni-l) 



i=l 



Z(Air + n) s 
= H(n\Aw + n) 



Z(An)M(n) 



(20) 



Again, the last line follows because the product in the previ- 
ous line is an unnormalized Dirichlet with parameters (An + n). 
Therefore, the integral over 9 must be equal to the corresponding 
Dirichlet normalization constant, Z(Air + n). Note that, con- 
fusingly, the negative hypergeometric distribution is sometimes 
called the inverse hypergeometric, an entirely different distribu- 
tion, and vice versa. 

Since we do know a reasonable value for the scale factor A 
we cannot use a simple Dirichlet prior. As an alternative, we 
explicitly acknowledge our uncertainly about A by building this 
indeterminacy into the prior itself. Rather than a single Dirichlet, 
we use the Dirichlet mixture; 



P(9\n) 



dAV(9\An)P(A). 



(21) 



The distribution P(A) is a hyperprior, a prior distribution placed 
upon a parameter of the Dirichlet prior. Following the same math- 
ematics as Eas. 1181201 we find that the posterior distribution is the 
Dirichlet mixture 



where 



P(9\n) 



P(A\n) = 



dAT>(9\ATr + n)P(A\n) , 



P(A)H'(n\Air + n) 
J™dAP(A)H'(n\An + n) 



(22) 



(23) 



In principle, we have to select and parameterize a functional 
form for the hyperprior, P(A). For example, an exponential dis- 
tribution, P(A) = A cxp(— XA), with mean 1 /A, might be appro- 
priate. Fortunately, we can often avoid selecting an explicit hy- 
perprior. In practice, given sufficient data, the probability of that 
data P(n\A) is a smooth, sharply peaked function of A This is il- 
lustrated in figure|5]using 10 7 observations of the 160,000 dimen- 
sional doublet substitution probability, where the mean prior dis- 
tribution is taken to be the product of singlet substitutions prob- 
abilities. If the prior distribution of A is reasonable, and neither 
very large nor very small over the range of interest, then the poste- 
rior distribution P(A\n) will also be very strongly peaked. More- 
over, the location of that peak will be almost totally independent 
of the prior placed on A In this limit the posterior Dirichlet mix- 
ture (Ea. l22> reduces to the single component that maximizes the 
probability of the data; 

P(9\n) re V(9\An + n), 

A = axgmaXj^P(A\n) pa &rgm.&x A P(n\A) , 
P(n\A) = H'(n\An + n). (24) 

Here, argmax 2 ./(a;) is the value of x that maximizes that function 
fix). 

Given any function of 9, the average of the function across 
the posterior distribution (the posterior mean estimate (PME) or 
Bayes' Estimate) minimizes the mean squared error of that es- 
timate. In particular, the posterior mean estimate of 9 (Eq. I17l > 

is 

/iPme A-Kj + rij 

9 * = ^"TaT- (25) 

Taken altogether, our practice is to take the raw doublet sub- 
stitution counts and construct a mean prior distribution tt based 
upon the approximation that substitutions on neighboring sites 
are uncorrected. We then find the scaling factor A that max- 
imizes the negative hypergeometric probability H'(n\Air + n). 
For our data the total number of observations N is around 
10 7 , for which the optimal scale factor A was found to be 
about 10 6 . The posterior mean estimate of the doublet substi- 
tution distribution is then used to construct the doublet substi- 
tution matrix. Code for constructing doublet substitution ma- 
trices using this procedure and for finding the optimal prior 
and posterior, given any set of observations and tt, a best 
guess for the true distribution 9, is available from our web site 
(http : / /compbio . berkeley . edul, along with other code 
and data for this work. Our programs make ex tensive use of 
the Open Sourced GNU Scienti fic Library (GSL) ( iGoughlEoolt 
Matsumoto & Nishimura, 1998). 
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Figure 5: The likelihood of observations as a function of the 
scale parameter A. With multinomial sampling and a Dirichlet 
prior the likelihood of the data follows the negative hypergeo- 
metric distribution, H 1 (n | Aw + n), where n is the count vector of 
observations, it is the mean prior estimate of the sampling distri- 
bution, and A is a scale parameter (Ea.l20>. Given a large number 
of observations (here, N — ^ m is about 10 7 ) the probability of 
the data is a smooth and very sharply peaked function of the scale 
parameter A. 
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